Introduction

This is the code used for my exploration of NBC’s 2016 Twitter Troll Dataset. It heavily uses packages found in the tidyverse such as dplyr, purrr, stringr, tidyr, and ggplot2. If you haven’t used many tidyverse packages, then a good resource to learn more about them is Grolemund and Wickham’s R for Data Science. Also, a particularly good resource to learn about purrr - a package for iterating over lists and vectors - then please read Bryan’s purrr tutorial. Also, I do not focus on graph visualization in this tutorial. If you want to learn more about graph/network visualization’s in R and igraph, then please check out Ognyanova’s AMAZING graph visualization tutorials.

Because the code here depends on so many different packages, I explicitly prependded each function with the package that it came from. This takes the format of package::function. While this is a lot more verbose and adds clutter to the tutorial, it makes it clear what functions you are using and where the functions are coming from.

Finally, the only package I will load is the magrittr package. This is because the %>% pipe function makes it easy to logically order functions. Essentially, it allows users to “unnest” functions. For example, the following function is difficult to read because the logic follows from the innermost function outwards.

unlist(strsplit(toupper('hello world'), ' '))
## [1] "HELLO" "WORLD"

However, written with the %>% pipe, we can reason about the code piecemeal.

library(magrittr)

'hello world' %>%
  toupper() %>%
  strsplit(' ') %>%
  unlist()
## [1] "HELLO" "WORLD"

If the above code confuses you, try to think that the output of the first function is the first input of the second function, the output of the second function is the first input of the first function, and so on…

To proceed with this tutorial you should have the following installed:

install.packages(`dplyr`,
                 `ggplot2`,
                 'igraph',
                 `purrr`,
                 `readr`,
                 'stm',
                 `stringr`
                 `tidyr`,
                 'tidytext'
                 )

Explore Twitter Retweet Network

Before anything, let’s load the data and look at what columns we have available.

#load data
tweets <- readr::read_csv('tweets.csv')

names(tweets)
##  [1] "user_id"               "user_key"             
##  [3] "created_at"            "created_str"          
##  [5] "retweet_count"         "retweeted"            
##  [7] "favorite_count"        "text"                 
##  [9] "tweet_id"              "source"               
## [11] "hashtags"              "expanded_urls"        
## [13] "posted"                "mentions"             
## [15] "retweeted_status_id"   "in_reply_to_status_id"

To create a retweeting network, we only need two columns from this data set - user_key and text. We can isolate these two columns with dplyr::select:

retweet_network <- tweets  %>% 
  dplyr::select(user_key, text) %>%
  dplyr::mutate(text = stringr::str_replace_all(text, "\\r|\\n", '') ) #clean text of newlines


retweet_network %>%
  head() %>%
  knitr::kable()
user_key text
ryanmaxwell_1 #IslamKills Are you trying to say that there were no terrorist attacks in Europe before refugees were let in?
detroitdailynew Clinton: Trump should’ve apologized more, attacked less https://t.co/eJampkoHFZ
cookncooks RT @ltapoll: Who was/is the best president of the past 25 years? (Vote & Retweet)
queenofthewo RT @jww372: I don’t have to guess your religion! #ChristmasAftermath
mrclydepratt RT @Shareblue: Pence and his lawyers decided which of his official emails the public could seehttps://t.co/HjhPguBK1Y by @alisonrose711
giselleevns @ModicaGiunta me, too!

Tweets that are actually retweets begin with RT. Because we only care about tweets that were retweeted, we can use dplyr::filter to only include instances of retweets. stringr::str_detect returns a boolean (TRUE/FALSE) for text that includes a string we are looking for.

retweet_network <- retweet_network %>%
  dplyr::filter(stringr::str_detect(text, '^RT\\s')) 

retweet_network %>%
  head() %>%
  knitr::kable()
user_key text
cookncooks RT @ltapoll: Who was/is the best president of the past 25 years? (Vote & Retweet)
queenofthewo RT @jww372: I don’t have to guess your religion! #ChristmasAftermath
mrclydepratt RT @Shareblue: Pence and his lawyers decided which of his official emails the public could seehttps://t.co/HjhPguBK1Y by @alisonrose711
baobaeham RT @MDBlanchfield: You’ll never guess who tweeted something false that he saw on TV - The Washington Post https://t.co/K2e4XdXRfu
judelambertusa RT @100PercFEDUP: New post: WATCH: DIAMOND AND SILK Rip On John Kerry Over Israel Comments (VIDEO) https://t.co/NkdKaQ9yYu
ameliebaldwin RT @AriaWilsonGOP: 3 Women Face Charges After Being Caught Stealing Dozens Of Trump Signs https://t.co/JjlZxaW3JN https://t.co/qW2Ok9ROxH

There is a clear pattern in the text. The character string RT is always followed by the twitter handle that is being retweeted. We can use stringr::str_extract to pull the twitter handles and dplyr::mutate to create a new column with this extracted information.

The code below includes (?<=@)[^:]+(?=:). This is a regex string. Click this link to learn more about regex. The other functions are used for cleaning and formatting purposes.

retweet_network <- retweet_network %>%
  dplyr::mutate(retweeted_account = stringr::str_extract(text, '(?<=@)[^:]+(?=:)') %>%
                  stringr::str_to_lower()) %>% #standardize twitter handles to lower case
  dplyr::filter(!is.na(retweeted_account)) %>% #remove text that starts with "RT" but aren't actually retweets
  dplyr::select(user_key, retweeted_account, text) %>% #reorder columns
  dplyr::distinct()

retweet_network %>%
  head() %>%
  knitr::kable()
user_key retweeted_account text
cookncooks ltapoll RT @ltapoll: Who was/is the best president of the past 25 years? (Vote & Retweet)
queenofthewo jww372 RT @jww372: I don’t have to guess your religion! #ChristmasAftermath
mrclydepratt shareblue RT @Shareblue: Pence and his lawyers decided which of his official emails the public could seehttps://t.co/HjhPguBK1Y by @alisonrose711
baobaeham mdblanchfield RT @MDBlanchfield: You’ll never guess who tweeted something false that he saw on TV - The Washington Post https://t.co/K2e4XdXRfu
judelambertusa 100percfedup RT @100PercFEDUP: New post: WATCH: DIAMOND AND SILK Rip On John Kerry Over Israel Comments (VIDEO) https://t.co/NkdKaQ9yYu
ameliebaldwin ariawilsongop RT @AriaWilsonGOP: 3 Women Face Charges After Being Caught Stealing Dozens Of Trump Signs https://t.co/JjlZxaW3JN https://t.co/qW2Ok9ROxH

We don’t necessarily care about the individual tweets. What we really care about is who a troll retweeted and how often. We can use dplyr::count to aggregate the total number of instances a particular user_key retweeted another acount

retweet_network_wt <- retweet_network %>%
  dplyr::count(user_key, retweeted_account, sort = T)

retweet_network_wt %>%
  head() %>%
  knitr::kable()
user_key retweeted_account n
paulinett blicqer 1049
melanymelanin blicqer 798
paulinett zaibatsunews 222
giselleevns chrixmorgan 218
brianaregland feministajones 200
hyddrox cmdorsey 178

This is a data frame with 81862. This data frame represents an edge list and so we may have too many edges to get a good understanding of our data. We can filter the edges down by choosing a cutoff for the edges. That is, if we assume that a troll retweeting an account less than 5 times is insignificant for our analysis, then we can remove them and clean up our graph.

filter_n <- 5

retweet_network_wt <- retweet_network_wt  %>%
  dplyr::filter(n >= filter_n) 

retweet_network_wt %>%
  head() %>%
  knitr::kable()
user_key retweeted_account n
paulinett blicqer 1049
melanymelanin blicqer 798
paulinett zaibatsunews 222
giselleevns chrixmorgan 218
brianaregland feministajones 200
hyddrox cmdorsey 178

Cool, now we have only 3675 edges in our graph. Let’s now convert the edge list data frame into an actual igraph graph. We can do this using igraph::graph_from_data_frame

g_rtwt <- igraph::graph_from_data_frame(retweet_network_wt)

summary(g_rtwt)
## IGRAPH 41eff6f DN-- 1555 3675 -- 
## + attr: name (v/c), n (e/n)

The D at the top of the summary stands for Directed Graph, that is, direction matters for the edges. The N stands for Named Graph, that is, each node has a unque name. We can add metadata to the graph directly with $ notation - similar to how we would in a list. Any valid name for a list or data frame will be a valid name for a graph attribute. The number of nodes (1555) and the number of edges (3675) follows the graph’s metadata. We are also given a list of edge attributes (e/c(haracter), e/n(numeric), e/l(ogical)) and vertex attributes (v/…).

g_rtwt$name <- '2016 Russian Twitter Troll Retweet Network'
g_rtwt$info <- "A graph inspired by NBC's and Neo4j's exploration."

summary(g_rtwt)
## IGRAPH 41eff6f DN-- 1555 3675 -- 2016 Russian Twitter Troll Retweet Networ
## + attr: name (g/c), info (g/c), name (v/c), n (e/n)

The name attribute is a special attribute for a graph and is shown in the summary. We can use $ notation to retrieve other graph attributes.

g_rtwt$info
## [1] "A graph inspired by NBC's and Neo4j's exploration."

If we want to plot a graph, then we simply need to use the plot function. For igraph graphs, plot takes special parameters to manipulate the vertices and edges on the plot. We will not go into depth about plotting graphs, but if you want to learn more, then please visit Katya Ognyanova’s detail tutorials on graph visualization.. You can also use ?igraph.plotting to learn more.

set.seed(4321)
plot(
  g_rtwt,
  vertex.size = 2,
  vertex.label = '',
  edge.arrow.size = .05,
  edge.width = .25,
  asp = 0 #aspect ratio
)

igraph can readily calculate many different centrality measurements such as igraph::betweenness, igraph::degree, and igraph::eigen_centrality. We will focus on igraph::page_rank. These functions return scores at the node level and the order of the scores correspond with the order of the vertices.

pr <- igraph::page_rank(g_rtwt)$vector
head(pr)
##     paulinett melanymelanin   giselleevns brianaregland       hyddrox 
##  0.0005891945  0.0005891945  0.0007005381  0.0005891945  0.0005891945 
##    tpartynews 
##  0.0018363397
head(igraph::V(g_rtwt)$name)
## [1] "paulinett"     "melanymelanin" "giselleevns"   "brianaregland"
## [5] "hyddrox"       "tpartynews"

Because these measurements have the same order of the vertices, we can store the measurements as a vertex attribute.

igraph::V(g_rtwt)$PageRank <- pr
igraph::V(g_rtwt)[[1:6]]
## + 6/1555 vertices, named, from 41eff6f:
##            name     PageRank
## 1     paulinett 0.0005891945
## 2 melanymelanin 0.0005891945
## 3   giselleevns 0.0007005381
## 4 brianaregland 0.0005891945
## 5       hyddrox 0.0005891945
## 6    tpartynews 0.0018363397

If we want to match the vertex information with outside information, the easiest way we can do that is convert the vertex list into a data frame with igraph::as_data_frame and then join it with the new data with dplyr::?_join. Let’s combine the vertex list with a troll’s total number of tweets.

vertex_df <- igraph::as_data_frame(g_rtwt, 'vertices') %>%
  dplyr::arrange(desc(PageRank))

edges_df <- igraph::as_data_frame(g_rtwt, 'edges')
  
total_tweets <- tweets %>%
  dplyr::select(user_key, text) %>%
  dplyr::count(user_key) %>%
  dplyr::rename(TotalTweets = n)

vertex_df <- dplyr::left_join(vertex_df, total_tweets, by = c('name' = 'user_key'))

vertex_df %>%
  head() %>%
  knitr::kable()
name PageRank TotalTweets
rt_com 0.0050983 NA
blacktolive 0.0039910 238
gloed_up 0.0038694 327
chiefplan1 0.0038029 NA
rt_america 0.0035941 NA
ten_gop 0.0029589 3194

If the TotalTweets of a node is NA, then the account is not listed in the list of trolls. This means the twitter trolls are retweeting tweets from real accounts. Let’s recreate the network and use the TotalTweets vertex attribute as something to filter out. We can actuaally remove vertices from a graph with -.

g_rtwt <- igraph::graph_from_data_frame(edges_df, T, vertex_df) %>%
  {. - igraph::V(.)[is.na(TotalTweets)]} %>%
  {. - igraph::V(.)[igraph::degree(.) == 0]} #remove unconnected nodes

summary(g_rtwt)
## IGRAPH 5dee1a6 DN-- 83 153 -- 
## + attr: name (v/c), PageRank (v/n), TotalTweets (v/n), n (e/n)

Let’s re-plot the graph.

set.seed(4321)
g_rtwt %>%
  plot(
    vertex.size = igraph::V(.)$PageRank/max(igraph::V(.)$PageRank) * 5 + 2,
    vertex.label = '',
    edge.arrow.size = .05,
    edge.width = .25,
    asp = 0 #aspect ratio
  )

igraph has a number of community detection alogrithms to use including igraph::informap.community, igraph::spinglass.community, and igraph::fastgreedy.community. Here, we will use igraph::walktrap.community.

g_community <- igraph::walktrap.community(graph = g_rtwt)
names(g_community)
## [1] "merges"     "modularity" "membership" "names"      "vcount"    
## [6] "algorithm"

The the community membership is listed in the same order as the vertices. This means we can store the membership as a vertex attribute. The communities are repesented as numbers. This particular graph has max(g_community$membership) communities. We can create a color palette for these communities.

igraph::V(g_rtwt)$community <- g_community$membership

community_pal <- scales::brewer_pal('qual')(max(igraph::V(g_rtwt)$community))
names(community_pal) <- 1:max(igraph::V(g_rtwt)$community)

community_pal
##         1         2         3         4         5 
## "#7FC97F" "#BEAED4" "#FDC086" "#FFFF99" "#386CB0"

color is a special vertex attribute. If it exists, then the color stored in the vertex is automatically plotted. Let’s iterate over the vertice and assign a color according to it’s community

igraph::V(g_rtwt)$color <- purrr::map_chr(igraph::V(g_rtwt)$community, function(x){
  community_pal[[x]]
})

set.seed(4321)
g_rtwt %>%
  plot(
      vertex.size = igraph::V(.)$PageRank/max(igraph::V(.)$PageRank) * 5 + 2,
      vertex.label = '',
      edge.arrow.size = .05,
      edge.width = .25,
      asp = 0 #aspect ratio
  )

Let’s take a moment to actually analyze the hashtags associated with these users. We can go back to our original dataset and try to match users with hashtags.

tweet_hashtag <- tweets %>% 
  dplyr::select(user_key, hashtags, text) %>%
  dplyr::distinct()%>% 
  dplyr::filter(hashtags != '[]') %>% #this represents a tweet with no hashtags
  dplyr::mutate(hashtags = purrr::map(hashtags, jsonlite::fromJSON)) %>% #the stored info is a json file
  tidyr::unnest()  %>%
  dplyr::select(user_key, hashtags)

Join the hashtag data with the vertex data, then aggregate

vertex_df <- igraph::as_data_frame(g_rtwt, 'vertices')
edge_df <- igraph::as_data_frame(g_rtwt, 'edges')

community_hashtags <- vertex_df %>%
  dplyr::left_join(tweet_hashtag, by = c('name' = 'user_key')) %>%
  dplyr::count(community, hashtags, sort = T)  %>%
  dplyr::group_by(community) %>%
  dplyr::top_n(6, wt = n) %>%
  dplyr::ungroup()

community_hashtags %>%
  head() %>%
  knitr::kable()
community hashtags n
4 maga 1598
4 Trump 1038
4 tcot 877
2 news 854
4 NeverHillary 747
3 RejectedDebateTopics 570
community_hashtags %>%
  dplyr::mutate(hashtags = purrr::map2_chr(hashtags, community, ~paste0(paste0(rep(' ', as.numeric(.y)), collapse = ''), .x))) %>%
  dplyr::arrange(desc(n)) %>%
  dplyr::mutate(hashtags = factor(hashtags, unique(hashtags))) %>%
  ggplot2::ggplot(ggplot2::aes(x = hashtags, y = n, fill = as.character(community))) +
  ggplot2::geom_col(color = 'black') +
  ggplot2::facet_wrap(~community, scales = 'free') +
  ggplot2::scale_fill_manual(values = community_pal) +
  ggplot2::coord_flip()+
  ggplot2::theme_bw() +
  ggplot2::theme(legend.position="none") +
  ggplot2::labs(
    x = ''
  )

Explore the usage of hashtags

Let’s revisit the tweet_hashtag data frame we created earlier.

head(tweet_hashtag) %>%
  knitr::kable()
user_key hashtags
ryanmaxwell_1 IslamKills
queenofthewo ChristmasAftermath
hiimkhloe Blacklivesmatter
jasper_fly MyFarewellWordsWouldBe
giselleevns My2017BiggestHope
pamela_moore13 Obama

If the first two columns of a data frame represent the two connected nodes in an edge list, then this data frame represents a bipartite network in which one node type is user and the other node type is hashtag.

user_key hashtags weight type
@newspeakdaily #politics 735 used_hashtag
@ameliebaldwin #maga 597 used_hashtag
@kansasdailynews #news 573 used_hashtag
@onlinecleveland #politics 501 used_hashtag
@washingtonline #politics 393 used_hashtag
@todaybostonma #politics 377 used_hashtag
## IGRAPH 34cf1dc DNWB 13336 36366 -- 
## + attr: name (v/c), type (v/c), color (v/c), size (v/n), weight
## | (e/n), type (e/c)

Again, to minimize the number of edges we got the count of user_key -> hashtags connection. We only want to keep the edges that “matter” and so we remove connections that occur less than a certain amount of times.

filter_n <- 6

tweet_hashtag_g <- tweet_hashtag_g %>%
  {. - igraph::E(.)[weight < filter_n]} %>%   #remove edges with weight less than the cutoff
  {. - igraph::V(.)[igraph::degree(.) == 0]}  #remove vertices that have no connections

summary(tweet_hashtag_g)
## IGRAPH 7ce09b6 DNWB 1027 3732 -- 
## + attr: name (v/c), type (v/c), color (v/c), size (v/n), weight
## | (e/n), type (e/c)
set.seed(4321)
tweet_hashtag_g %>%
  plot( 
     vertex.label = '',
     edge.arrow.mode = '-',
     edge.width = .1,
     layout = igraph::layout_as_bipartite(., types = igraph::V(.)$type == 'hashtag'),
     asp = 0
     )

The cool thing about bipartite graphs is that we can derive single type graphs. That is, we can project a new graph by combining two nodes of the same type by the mutual neighboring nodes they share.

tweet_hashtag_g %>%
  {
    igraph::V(.)$type <- igraph::V(.)$type == 'user'; #boolean type is necessary for bipartite projection
    igraph::V(.)$degree <- igraph::degree(.); #the total number of connected edges
    igraph::V(.)$weighted_degree <- igraph::strength(.); #the sum of connected edge weights
    .
  } %>%
  igraph::bipartite_projection() %>%
  {
    . <- purrr::map(., function(x){
      igraph::V(x)$component <-igraph::components(x)$membership;
      return(x)
    })
    hashtag_g <<- .$proj1; #projection 1 is type FALSE (hashtags)
    user_g <<- .$proj2; #projection 2 is type TRUE (user)
  }
## Warning in igraph::bipartite_projection(.): vertex types converted to
## logical

By projecting the bipartite graph we now have two graphs. A graph where users are indirectly connected to other users by the hashtags they both use…

set.seed(4321)
plot(user_g, vertex.size = 3, vertex.label = '', asp = 0)

and a graph where hashtags are indirectly connected to other hashtags by the users that use them.

This projected graph of hashtags has multiple components, for now, let’s only exampine the largest component. We can see which vertex belongs to which component with igraph::components. We can then group the components together with igraph::groups and remove all nodes that do not belong in the largest comonent.

hashtag_g_components <- igraph::components(hashtag_g)

largestComponent <- hashtag_g_components %>%
  igraph::groups() %>%
  {
    maxL <- max(purrr::map_dbl(., length));
    .[purrr::map_lgl(., function(x){length(x) == maxL})] %>%
      unlist()
  } %>%
  {hashtag_g - igraph::V(hashtag_g)[!name %in% .]}
  

plot(largestComponent, vertex.size = 3, vertex.label = '', edge.width = .1, asp = 0)

We can now return to the measurements and community detection alorithms we used earlier.

igraph::V(largestComponent)$community <- igraph::walktrap.community(largestComponent)$membership
igraph::V(largestComponent)$PageRank <- igraph::page_rank(largestComponent)$vector

hashtag_pal <- scales::brewer_pal('qual')(max(igraph::V(largestComponent)$community))
names(hashtag_pal) <- as.character(1:length(hashtag_pal)) #named vector is useful for ggplot2

igraph::V(largestComponent)$color <- purrr::map_chr(igraph::V(largestComponent)$community, ~hashtag_pal[.x])

largestComponent %>%
  plot(
    vertex.label= '',
    vertex.size = igraph::V(.)$PageRank/max(igraph::V(.)$PageRank) * 5 + 2,
    edge.width = .1,
    asp = 0
  )

Let’s take a look at what hashtags are placed together in the same communities.

We’ve already identified the different communities, now let’s examine how the different communities interact with eachother. We can now create super nodes that represent the communities. Edges in this new graph would represent people who used hashtags from two different communities. We can do this by converting the graph back to a data frame and combining the hashtags that belong in the same community togeather.

First, we want to tag edge with the communities of both nodes. This will be helpful in determinining which edges connect two different communties.

largestComponent_summary <- largestComponent %>%
  {
    igraph::E(.)$tailCommunity <- igraph::tail_of(., igraph::E(.))$community; #store node information in  
    igraph::E(.)$headCommunity <- igraph::head_of(., igraph::E(.))$community; #edges to reference later
    .
  } 

igraph::E(largestComponent_summary)[[1:6]]
## + 6/64921 edges from 80f156e (vertex names):
##        tail       head tid hid weight tailCommunity headCommunity
## 1 #politics    #health   1 339      5             4             4
## 2 #politics      #news   1   3     19             4             4
## 3 #politics #cleveland   1  23      1             4             4
## 4 #politics    #sports   1 128     12             4             4
## 5 #politics     #money   1 406      3             4             4
## 6 #politics  #business   1 219      8             4             4

Then we convert the graph into a dataframe and keep only edges connecting two different communities. We then want to create the top_hashes edge attribute - this is a single string summary of which hashtags people commonly use from both communities.

vertex_df <- igraph::as_data_frame(largestComponent_summary, 'vertices')
edge_df <- igraph::as_data_frame(largestComponent_summary, 'edges')

edge_df <- edge_df %>%
  dplyr::filter(tailCommunity != headCommunity) %>%# we only want edges connecting two different communities 
  #1-6 is the same as 6-1, so let's remove dupliates
  dplyr::mutate(
    tail = purrr::map2_dbl(tailCommunity, headCommunity, min) %>% round %>% as.character(),
    head = purrr::map2_dbl(tailCommunity, headCommunity, max) %>% round %>% as.character()
  ) %>%
  dplyr::group_by(tail, head) %>%
  tidyr::nest(.key = 'top_hashes') %>%
  dplyr::mutate(top_hashes = purrr::map(top_hashes, function(x){
    top_hashes <- x %>%
      dplyr::arrange(dplyr::desc(weight)) %>%
      head(3) %>%
      {paste(.$from, .$to, sep = ' | ', collapse = '\n')}

    })) %>%
  tidyr::unnest()

edge_df
## # A tibble: 8 x 3
##   tail  head  top_hashes                                                   
##   <chr> <chr> <chr>                                                        
## 1 4     6     "#maga | #thingsmoretrustedthanhillary\n#trump | #thingsmore…
## 2 1     4     "#nowplaying | #blacklivesmatter\n#nowplaying | #staywoke\n#…
## 3 3     4     "#trump | #merkelmussbleiben\n#trump | #merkel\n#trump | #mo…
## 4 2     4     "#nodapl | #indigenous\n#nodapl | #tairp\n#trump | #indigeno…
## 5 4     7     "#debate | #podernfamily\n#debate | #hiddenfigures\n#debate …
## 6 1     7     "#nowplaying | #podernfamily\n#nowplaying | #hiddenfigures\n…
## 7 3     8     #merkel | #jugendmitmerkel                                   
## 8 4     5     "#syria | #us\n#isis | #syria\n#isis | #iraq"

We want to make a similar vertex attribute to give us an idea of what hashtags belong to each community. We should also take this opportunity to color the nodes according to their community pallete.

vertex_df <- vertex_df %>%
  dplyr::mutate(community = as.character(community)) %>%
  dplyr::group_by(community) %>%
  tidyr::nest() %>%
  dplyr::mutate(top_hashes = purrr::map_chr(data, function(x){
    x %>%
      head %>%
      .$name %>%
      paste(collapse = '\n')
    })) %>%
  dplyr::select(-data) %>%
  dplyr::mutate(color = purrr::map_chr(community, function(x){hashtag_pal[x]}))

vertex_df
## # A tibble: 8 x 3
##   community top_hashes                                               color 
##   <chr>     <chr>                                                    <chr> 
## 1 4         "#politics\n#maga\n#news\n#trump\n#neverhillary\n#topvi… #FFFF…
## 2 6         "#makemehateyouinonephrase\n#sometimesitsokto\n#idrunfo… #F002…
## 3 3         "#merkelmussbleiben\n#merkel\n#g20\n#morgen\n#merkelser… #FDC0…
## 4 1         "#nowplaying\n#listen2\n#god\n#soundcloud\n#np\n#tashif… #7FC9…
## 5 7         "#podernfamily\n#hiddenfigures\n#podcast\n#scifisunday\… #BF5B…
## 6 5         "#syria\n#iraq\n#mosul\n#aleppo\n#saa"                   #386C…
## 7 8         #jugendmitmerkel                                         #6666…
## 8 2         "#indigenous\n#tairp\n#blackunity\n#msnbc\n#morningjoe"  #BEAE…

We can see the summary structure of the plot.

set.seed(4321)
largestComponent_summary <- igraph::graph_from_data_frame(edge_df, F, vertex_df)
plot(largestComponent_summary, asp = 0)

We can replace the nodes with text that summarizes the hashtags found in the community.

We can also label the edges with the hashtags that connect the communities.

Bonus: Topic Modelling

We won’t conduct a full text analysis of these tweets, but it is worth mentioning that in topic modelling, we are often tasked with the tokenization of text - that is, we need to split the text into single words. We are also tasked with the removal of stop words or junk words that only add noise to the model. The interesting thing about the analysis we have is that hashtags serve as a type of tokenized text. It also doesn’t need to be cleaned because all hashtags, by nature of them being explicitly created, are important.

We will proceed to create a topic model with these hashtags. If you want to dig deeper into the topic modelling world, then I highly recommend reading Silge and Robinson’s Tidy Text Mining in R.

Let’s revisit the user -> hashtag edge list we created earlier and select edges that only belong in the larger component we just explored.

tweet_hashtag_edges <- tweet_hashtag_edges %>%
  dplyr::filter(hashtags %in% igraph::V(largestComponent)$name)  %>%
  dplyr::select(-type)

With this data frame we can create something called a document term matrix. This is a matrix where the documents are the rows, the terms are the columns, and their co-occurance is stored in their intersection. Let’s create one:

tweets_sparse_hash <- tweet_hashtag_edges %>%
  tidytext::cast_sparse(user_key, hashtags, weight)

tweets_sparse_hash[1:10, 1:5] 
## 10 x 5 sparse Matrix of class "dgCMatrix"
##                  #politics #maga #news #makemehateyouinonephrase #trump
## @newspeakdaily         735     .     .                         .      .
## @ameliebaldwin           2   597     6                         .    322
## @kansasdailynews       326     .   573                         .      .
## @onlinecleveland       501     .     .                         .      .
## @washingtonline        393     .     8                         .      .
## @todaybostonma         377     .     2                         .      .
## @hyddrox                 1   363     6                         .    280
## @giselleevns             .     .     .                       334      6
## @batonrougevoice       315     .     9                         .      .
## @specialaffair           .     .   190                         .      .

The beautiful thing about this kind of matrix is that there are a number of topic modelling functions that can work with this. Latent Dirichlet Allocation (LDA) is one that is frequently used. However, we will work with Structural Topic Models (STM). If you want to learn more about STM, you can check out the package authors’ site which contain’s a number of great references. Let’s use STM to identify 8 topics in our text. I chose 8 to match the

##Model takes a minute.
##I just pre ran it for you.  
# set.seed(4321)
# topic_model_hash <- stm::stm(tweets_sparse_hash, K = 8,
#                              verbose = FALSE, init.type = "Spectral")
# readr::write_rds(topic_model_hash, 'twitter_troll_topic_model_8.rds')

topic_model_hash <- readr::read_rds("twitter_troll_topic_model_8.rds")

summary(topic_model_hash)
##             Length Class  Mode     
## mu             2   -none- list     
## sigma         49   -none- numeric  
## beta           1   -none- list     
## settings      13   -none- list     
## vocab        846   -none- character
## convergence    4   -none- list     
## theta       1456   -none- numeric  
## eta         1274   -none- numeric  
## invsigma      49   -none- numeric  
## time           1   -none- numeric  
## version        1   -none- character

We can grab the beta - the probability a term (here a hashtag) belongs to a topic.

td_beta_hash <- tidytext::tidy(topic_model_hash)

td_beta_hash %>%
  head(8) %>%
  knitr::kable()
topic term beta
1 #politics 0.9028386
2 #politics 0.0000000
3 #politics 0.0000000
4 #politics 0.0000000
5 #politics 0.0000000
6 #politics 0.0000000
7 #politics 0.0000000
8 #politics 0.0000000

Here we see that #politics has a strong probability of belonging to topic 1. It’s important to note that a term can belong to many topics. The beta simply tells us how likely we will see a particular wodrd in a particular topic. Let’s explore the words most closely related to each topic.

td_beta_hash %>%
  dplyr::group_by(topic) %>%
  dplyr::top_n(5, beta) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(dplyr::desc(beta)) %>%
  dplyr::mutate(term = purrr::map2_chr(term, topic, ~paste0(paste0(rep(' ', as.numeric(.y)), collapse = ''), .x))) %>%
  dplyr::mutate(term = factor(term, unique(term))) %>%
  dplyr::mutate(topic = paste0("Topic ", topic)) %>%
  ggplot2::ggplot(ggplot2::aes(term, beta, fill = as.factor(topic))) +
  ggplot2::geom_col(alpha = 0.8, show.legend = FALSE, color = 'black') +
  ggplot2::facet_wrap(~ topic, scales = "free") +
  ggplot2::coord_flip() +
  ggplot2::labs(x = NULL, y = expression(beta),
       title = "Grouping of Hashtags: Highest word probabilities for each topic",
       subtitle = "Different words are associated with different topics") +
  ggplot2::scale_fill_brewer(type = 'qual') +
  ggplot2::theme_bw()

Now, like we did with the community detection alogrithm earlier, we can use these topics to mark or color the our graph. Again, while it is possible that a word can be strongly related to multiple topics, we will need to choose one topic to color a hashtag. To do this, we will choose the topic associated with the word’s highest beta.

markedTopic<- td_beta_hash %>%
  dplyr::group_by(term) %>%
  dplyr::top_n(1, wt = beta) %>%
  dplyr::select(term, topic, beta) 

markedTopic %>%
  head %>%
  knitr::kable()
term topic beta
#politics 1 0.9028386
#maga 7 0.1008277
#news 4 0.3310762
#makemehateyouinonephrase 5 0.0544027
#trump 7 0.0885790
#neverhillary 6 0.0458019
topicLargestComponent <- largestComponent %>%
  igraph::as_data_frame('both') %>%
  {
    .$vertices <- dplyr::left_join(.$vertices, markedTopic, by = c('name' = 'term')) %>%
      dplyr::mutate(color = purrr::map_chr(topic, ~hashtag_pal[.x]));
    
    igraph::graph_from_data_frame(.$edges, F, .$vertices)
  }

plot(topicLargestComponent, vertex.label = '', asp = 0, vertex.size = 3, edge.width = .1)

Conclusion

The really cool thing about working with twitter data sets is that you can explore a lot of different connections. You can follow a retweet network, you can see how hashtags relate to eachother and you can even explore how different people follow one another. I’m not sure if analysis of this particular dataset taught us anything new about the Russian Twitter Trolls, but it did give us an opportunity to see how to graph networks in R and to see how the igraph package functions within the greater R ecosystem. I hope this tutorial helped you learn something.

Cheers,

Ben